Factors and their extent of Impact on House Prices in Saratoga, New York

Group L04G03

Charlie Tran, Matthew Tang, Kevin Li, Richard Erren and Faiyad Ahmed

The University of Sydney

Introduction / Problem Statement

In recent years, the housing market has become a central topic of interest, as property prices have skyrocketed across the United States. New York is now in top 10 of the most expensive city in the world. This has raised important questions for homeowners, real estate investors, and city planners alike:

Note

What are the key factors that drive the value of a home?

Idealization Process

Data Description

Data resource: Data on houses in Saratoga County, New York, USA in 2006

Data structure: 1734 observations on 17 variables. Test variable was ignored as its meaning was unknown.

- Price = price of the house

- Lot.Size = size of the house’s lot in acres

- Age = age of the house in years

- Land.Value = value of land (in $USD)

- Living.Area = living area in square feet

- Pct.College = percentage of neighborhood that graduated college

- Bedrooms = number of bedrooms

- Fireplaces = number of fireplaces

- Bathrooms = number of bathrooms

- Rooms = number of rooms

- Heating.Type = type of heating system

- Fuel.Type = type of fuel used for heating

- Sewer.Type = type of sewer system

- Waterfront = whether the property includes waterfront

- New.Construction = whether the property is a new construction

- Central.Air = whether the house has central air

Data selection

In our project statement, Price is the dependent variable to be predicted and all other variables are considered independent variables.

Based on the Heat Map, Full Model and Stepwise Model, we choose these variables as independent variables: Lot.Size, Waterfront, Land.Value, New.Construct, Living.Area, Bathroom

Heat Map

Data selection

In our project statement, Price is the dependent variable to be predicted and all other variables are considered independent variables.

Based on the Heat Map, Full Model and Stepwise Model, we choose these variables as independent variables: Lot.Size, Waterfront, Land.Value, New.Construct, Living.Area, Bathroom

Full Model

Data selection

In our project statement, Price is the dependent variable to be predicted and all other variables are considered independent variables.

Based on the Heat Map, Full Model and Stepwise Model, we choose these variables as independent variables: Lot.Size, Waterfront, Land.Value, New.Construct, Living.Area, Bathroom

Stepwise

Log selection

To complete the linear regression analysis and determine whether to apply logarithmic transformations, we examined the selected variables and made the following decisions:

  • Log Land.Value and Living.Area: Large ranges and right skewed, improving model stability.

  • No log for Waterfront, Central.Air, New.Construct: Binary variables unsuitable for logging.

  • No log for Lot.Size and Bathrooms: Small ranges or contain zeros, not suitable for logging.

    Property Statistics Summary
    Statistic Lot_Size Waterfront Land_Value Central_Air New_Construct Living_Area Bathrooms
    Min. 0.0000 0.00000 200 0.0000 0.00000 616 0.000
    1st Qu. 0.1700 0.00000 15100 0.0000 0.00000 1300 1.500
    Median 0.3700 0.00000 25000 0.0000 0.00000 1632 2.000
    Mean 0.5003 0.00865 34536 0.3662 0.04671 1753 1.899
    3rd Qu. 0.5400 0.00000 40200 1.0000 0.00000 2134 2.500
    Max. 12.2000 1.00000 412600 1.0000 1.00000 5228 4.500

Model Selection

In order to check the correlation between price and other variables affecting price, we chose the following model to complete the linear regression analysis:

  1. Linear-Linear Model:
    • uses selected independent variables in the dataset to predict Price.
  2. Linear-Log Model:
    • logged variables Land.Value and Living Area.
  3. Log-Linear Model:
    • logged the dependent variable Price.
  4. Log-Log Model:
    • logged variables Land.Value, Living Area and Price.

Model Performance Summary

For the final model, we chose the Log - Log model because it had the lowest RMSE and MAE and the second highest R-squared.

Model RMSE R-squared MAE
Linear – Linear model 58752.03 0.645126 42041.38
Linear – Log model 63737.29 0.5837296 45835.12
Log – Linear model 0.2986343 0.5741902 0.2117078
Log – Log model 0.2937391 0.5870239 0.2109565

Full Model Assumption Checking

  1. Linearity: It has an autocorrelation of 0.1701934 which shows that it is random and passes the linearity test.
  2. Independence: The D-W statistic is 1.660.

Full Model Assumption Checking

  1. Homoskedasticity: The residual plot shows heteroskedasticity.
  2. Normality: Most of the dots are aligned in the qqplot but there is a spike at the end and a drop in the tail.

Final Model Assumption checking

  1. Linearity: The autocorrelation of the residual plot is 0.225
  2. Independence: The D-W statistic is 1.547

Final Model Assumption checking

  1. Homoskedasticity: The residual plot gets larger relatively when it’s away from the line.
  2. Normality: More dots follow the line in the Q-Q plot.

Final Model Visualization

Final Model Interpretation

The general form of the log-log regression equation is:

\[\begin{aligned} \log(\text{Price}) &= \beta_0 + \beta_1 (\text{Lot.Size}) + \beta_2 \text{Waterfront} \\ &\quad + \beta_3 \log(\text{Land.Value}) + \beta_4 \text{New.Construct} \\ &\quad + \beta_5 \log(\text{Living.Area}) + \beta_6 \text{Bathrooms} \end{aligned}\]

Using the given coefficients, the formula becomes:

\[\begin{aligned} \log(\text{Price}) &= 6.416188 + 0.036575 \, (\text{Lot.Size}) + 0.540571 \, (\text{Waterfront}) \\ &\quad + 0.122890 \, (\log(\text{Land.Value})) - 0.063545 \, (\text{New.Construct}) \\ &\quad + 0.571363 \, (\log(\text{Living.Area})) + 0.138367 \, (\text{Bathrooms}) \end{aligned}\]

Limitations and bias in the data

  • Multicollinearity: High correlation between living area and bedrooms (coefficient of 0.73) can lead to inflated standard errors which makes it difficult to assess individual significance

  • Violation of assumptions

    • Independence: D-W statistic of 1.5472 suggest slight autocorrelation between residuals

    • Normality: Outliers in the dataset affected the Q-Q plot, especially at the tails, indicating a deviation froma. normal distribution

  • Homoskedasticity: Residual plots indicate such, meaning the variance of errors increased with certain values, violating the assumption of constant variance

  • Model Complexity: The inclusion of multiple variables increases the complexity of the model, making it harder to interpret and potentially overfitting the data to this specific dataset.

  • Data Limitations:

    • The data used is from a specific geographic location (Saratoga County, NY), limiting the generalizability of the model to other regions or time periods.

    • Some important variables (e.g., economic factors like interest rates, inflation, or proximity to amenities) may not have been included, which could impact property prices.

Future Implications of Model

  • Property Valuation:

    • Can be used by real estate platforms, governments, and investors for accurate property value assessments.
  • Mortgage Lending & Risk Assessment:

    • Lenders can use these models to assess mortgage eligibility and potential risks.
  • Decision-Making for Stakeholders:

    • Buyers, sellers, and investors can leverage the model for better decision-making, especially regarding features like waterfront properties or new construction.
  • Market Trend Insights:

    • Provides insights into housing market trends and informs policy development.

    • Investing in properties with a waterfront view or new construction could be investigated to learn further influence.

Summary

  • The modelling aimed to predict house prices using data from Saratoga County, New York

  • The final model chosen, a log-log regression, performed the best in terms of RMSE and MAE

  • Key findings showed that Waterfront, Living Are, and Bathrooms were significant predictors

  • The formula derived can predict property valuation, assisting in a variety of applications:

    • Property valuation, mortgage lending, risk assessment

    • Decision-making support for real estate professionals, buyers and sellers

Thank you very much!

  • Q&A